Two steps to modeling
Step 1: Identify a family of models which express a generic pattern between your variables of interest.
Possible model family: Linear model, i.e. \(child = a_1 + a_2 \times parent\).
- Variables:
child
and parent
- Model parameters: \(a_1\) and \(a_2\)
Many other possible models: linear without intercept, quadratic, exponential, …
What do we mean by “closely matching the data”?
We choose \(a_1\) and \(a_2\) such that some objective function (loss function) is minimized.
Most common objective: Minimize the sum of squares of the black lines below.
Linear models in R
- Linear regression can be done with the
lm
function
- Syntax:
lm(formula, data = df)
- Formulas look like
y ~ x
, which lm
will translate to a function like \(y = a_1 + a_2 \cdot x\)
Models with categorical variables
Consider modeling valence ~ mode
.
- Does the model \(valence = a_1 + a_2 \cdot mode\) make sense?
- 3 + 4 \(\cdot\) “Major”??
- What R does:
- Choose a baseline category (say, “Minor”)
- Model \(valence = a_1 + a_2 \cdot modeMajor\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
- \(valence = a_1\) if Minor, \(valence = a_1 + a_2\) if Major
Additive models
Formula valence ~ loudness + mode
translates to
- \(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
- \(valence = a_1 + a_2 \cdot loudness\) if Minor
- \(valence = (a_1 + a_3) + a_2 \cdot loudness\) if Major
- Same gradient, different intercept
Models with interaction
Formula valence ~ loudness * mode
translates to
- \(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor + \color{blue}{a_4 \cdot loudness \cdot modeMajor}\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
- \(valence = a_1 + a_2 \cdot loudness\) if Minor
- \(valence = (a_1 + a_3) + (a_2 + a_4) \cdot loudness\) if Major
- Different gradient, different intercept
Summary of the course
- Variable types
- Basic objects in R (vectors, lists, data frames)
- Plotting data with
ggplot2
- Transforming and joining data with
dplyr
- Importing and exporting data
- R projects, R scripts and R markdown
- Making maps
- Basic statistical testing and modeling
Where do we go from here?
- Read R for Data Science from cover to cover!
- Take short courses on DataCamp
- Writing your own functions and running simulations
- Advanced mapping with
ggmap
- Advanced regression models with
glmnet
- Interactive web apps with
shiny
- Text analysis with
tidytext
- Recommmended text: Text Mining with R by Julia Silge and David Robinson (avaible online for free at tidytextmining.com)
Other Stanford courses
- Programming: CS 106A
- Statistical methods: STATS 60, STATS 101
Thank you! :)